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Abstract 

Fish kill events (FKE) in the caldera lake of Taal occur rarely (only 0.5% in the last 10 years) 
but each event has a long-term effect on the environmental health of the lake ecosystem, as well as 
a devastating effect on the financial and emotional aspects of the residents whose livelihood rely 
on aquaculture farming. Predicting with high accuracy when within seven days and where on 
the vast expanse of the lake will FKEs strike will be a very important early warning tool for the 
lake’s aquaculture industry. Mathematical models to predict the occurrences of EKEs developed 
by several studies done in the past use as predictors the physico-chemical characteristics of the 
lake water, as well as the meteorological parameters above it. Some of the models, however, 
did not provide acceptable predictive accuracy and enough early warning because they were 
developed with unbalanced binary data set, i.e., characterized by dense negative examples (no 
EKE) and highly sparse positive examples (with EKE). Other models require setting up an 
expensive sensor network to measure the water parameters not only at the surface but also at 
several depths. Presented in this paper is a system for capturing, measuring, and visualizing 
the contextual sentiment polarity (CSP) of dated and geolocated social media microposts of 
residents within 10km radius of the Taal Volcano crater (14°N, 121°E). High frequency negative 
CSP co-occur with FKE for two occasions making human expressions a viable non-physical 
sensors for impending EKE to augment existing mathematical models. 


1. Introduction 

Fish kill events (FKEs) in the caldera lake of Taal in the province of Batangas occur infrequently 
with recorded daily frequency of only 0.5% for the last 10 years. When FKEs occur, however, their 
consequences to the lives of the people whose livelihood rely on fish farming is devastating. Sensing 
with higher accuracy, therefore, when in the next seven days and where in the vast expanse of 
Taal Lake will FKE occur will be an important early warning tool for the Taal Lake aquaculture 
industry. The development of such a high fidelity sensing system will not only be interesting to the 
practioners of the fields of limnology, but also to those who are in the practice of fisheries science, 
meteorology, economics, predictive analytics, and agricultural and biosystems engineering. 

The prediction of the occurrences of fish kill events (FKE) in Taal Lake is challenging to practi¬ 
tioners in these fields, not only because of its trivial impact to the livelihood of the human population 
who rely on the lake, but also because the lake is a very unique subject of study compared to other 
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aquaculture lakes in the world. Geographically, the lake was formed over a caldera and is host to 
Taal Volcano, a famous tourist attraction known to be the smallest active volcano in the world, 
which stands on a smaller lake within an island, which in turn is located in the middle of Taal 
Lake. 


FKE in Taal Lake may not only be the result of the already known usual chemico-physical factors 
affecting most lakes worldwide, but may also be worsen by the presence of volcanic vents that 
actively spew sulfuric emissions. The dynamics of these emissions do not only result to having 
extremely high temperature difference between the surface and bottom water, but additionally 
the jets of high-temperature streams will cause mechanical stirrings of the lake bottom. As of 
this writing, no studies have been conducted that can quantitatively predict when such emissions 
occur and at what magnitude, much so where these thermal vents are spatially located on the lake 
floor despite the few attempts to map the bottom with acceptable-resolution bathymetry. Even 
if high-resolution bathymetry from high-fidelity sonar technologies can be obtained for Taal Lake, 
because Taal Volcano is an active volcano, new volcanic vents form while the present ones undergo 
an unpredictable cycle of closing down and opening up depending on various tectonic activities 
that occur (mostly are hypothesized to be undetected) within the Luzon geological plate. 


Despite of these challenges, several studies have been made in the past to predict the Taal 
Lake FKEs through predictive modeling using as predictors the physico-chemical character¬ 
istics of the lake water in combina tion with the lake’s immediate meteorological parame¬ 
ters ( Magcale-Macandog et ah . 2ni2a m I2ni3al bh. Most of these models, specifically those 


that used mined patterns from historical data, faced the problem of unbalanced binary dataset, 
i.e., having dense negative daily data (no FKE) over sparse positive daily data (with EKE). Note 
that the granularity of temporal variables in these dataset was set to daily basis because this 
is the finest granularity that meteorological data can be obtained from Philippine Atmospheric, 
Geophysical and Astronomical Services Administration (PAGASA) Station in Ambulong, Tanauan 
City, Batangas (14.083°N, 121.050°E, lO.OMASL), the nearest and only meteorological station in 
the whole province. Although the physico-chemical characteristics of the lake water at various 
depths as recorded and curated by the Bureau of Fisheries and Aquatic Resources in Region IV-A 
(BFAR4A) is in the daily granularity, the frequency when the reading of these dataset does not 
happen on a regular daily basis. Most critical characteristics are only measured when there is a 
reported observation of the symptoms of FKE. Thus, the physico-chemical dataset contains a lot of 
missing records. Even though standard statistical regression and time-series techniques will work 
for datasets with missing records, they are not appropriate for modeling patterns characterized by 
dense-negative and sparse-positive binary records. 


Probabilistic methods, such as the Bayesian network of models inferred from historical qualitative 
and quantitative unbalanced dataset provides an acceptable predictive rate and accuracy but is not 
realistically useful in Taal because the models require a vast network of physico-chemical sensors. 
Before the Bayesian network of models can be fully utilized, the government in cooperation with the 
private sector must invest highly in the development, operation, and maintenance of high resolution 
sensing infrastructure that will cover not only the entire surface of the lake but also that which 
will expand several meters deep along the lake depth, preferably at least five meters more than the 
average depth of fish cages. If the Bayesian network of models prove to be accurately useful when 
this network of sensors is put in place, then such an investment is projected to be costlier than the 
savings the fish industry will incur in avoiding FKEs. Thus, probabilistic modeling methods that 
rely on the input from expensive sensor readings are not an economically worthy undertaking. 
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While the models developed using probabilistic means may require high investment for the govern¬ 
ment and stakeholders, does it mean that these kind of undertakings, however accurate and useful 
the results are, will be put to waste? Definitely not if a new kind of sensors can be deployed at the 
lake without spending much for its development, operation, and maintenance. In recent years, non- 
physico-chemical sensors have emerged as a new “device” to somehow sense the environment with 
high-fidelity and utmost recency. These “devices” are called “social sensors.” Social sensors measure 
the apparent general sentiment of the human population where environmental, social, political, or 
economic events such as earthquakes, storms, disease outbreaks, population disturbances, sporting 
events, movie premiers, political campaigns, and stock market crashes are currently happening. The 
population’s general sentiment can easily be read and then measured by processing the online post¬ 
ings of users on the social media over the Internet. The sentiment can be computed using a network 
of computers as soon as the user posts an experienc e, idea, or opin i on. The utility of this emergin g 


sensor has already been explo i ted in the wo r ks of lAramaki etZI (l2nnh: iFraustino et'ZI (1201 d l 


Lampos and Cristianinil (120121 1: ISakaki et al.l (l20inl l. and lhi and Cardid (l201.‘ll l. where negative (or 


positive) sentiments co-occur at the same general location (and direction) of an environmental 
disaster (or happy social event). 

In a past effort (jPabico and Magcale-Macandoe . 2014l l. the user-generated microblogs from the 
social media Twitter were studied to find out whether a high frequency of words that describe 
FKE in Taal Lake co-occur with the actual fish kills. It was found out that words such as “maitim 
na tubig” (black water), “amoy asupre” (sulfuric odor), and “nahibay na isda” (dizzy fish) among 
others, have been observed from microblogs originating from within the lOKm radius of the peak 
of Taal volcano crater (14°N, 121°E). Geo-locating microblogs are not difficult to do because most 
of them are geotagged and time-stamped, especially those that were sent through GPS-equipped 
smart phones. Most fish cages in Taal Lake, as well as the communities whose livelihood rely on fish 
farming, are located within this radius. Through years of experience, residents in these communities 
have already compiled observable symptoms (GOS) for FKE which they use as a form of adaptive 
mechanism to prepare them for FKE and to warn others of the onset of FKE in their locality. It is 
very interesting to note that the increase in the observed frequency of words in the GOS co-occur 
with the recorded FKEs, as well as when the lake water quality level is deemed critical by BFAR4A, 
suggesting that the frequency may be a good indicator for an onset of FKE. However, FKE are 
so locally-specific that they may strike one fish cage, but nearby fish cages (even as near as 5m) 
will not be affected. Eurther, COS from one location may not be the same with the others. For 
example, there are fish cages which are more often affected by “amoy asupre,” suggesting thermal 
vents at the nearby (though not directly) bottom, than by “berdeng tubig” (green water). Thus, 
the COS in these locations does not include “berdeng tubig.” On the other hand, there are fish 
cages which are always affected by “berdeng tubig” than by “amoy asupre.” Intuitively, the COS 
in these areas does not include “amoy asupre.” In either case, “nahibay na isda” were observed, 
suggesting dissolved oxygen depletion. It has already become a local knowledge by the residents 
in the area that FKE always follow after “nahibay na isda” is observed for an extended period of 
time. 


The general Twitter microposts of residents in the Taal Lake communities may not include the 
COS, but this does not necessarily mean that the poster does not experience negative feelings 
(e.g., anxiety) over the possibility of a FKE in their area. In fact, a huge percentage of polled 
microposts from the Taal Lake area during the onset of FKE do not contain COS. However, when 
the sentiments of the microposts were measured, a huge percentage of the polled samples was 
found out to be negative, suggesting that anxiety, sadness, fear, and probably depression are being 
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experienced by the members of the polled population. Specifically, an increasing frequency of 
negative contextual sentiment polarity (CSP) of microposts was observed to have co-occurred 1-6 
days prior to a FKE during the reported events in February 2013 and in January 2014. During 
times when symptoms of FKE are not reported, the general sentiments of microposts range from 
neutral to positive (e.g., optimistic, happy, and elated). This suggests that there does exist evidence 
that FKE’s in Taal Lake and negative CSP in microposts of Taal Lake residents co-occur. If the 
frequency of negative CSP increases, one may be able to prepare for an FKE to occur within six 
days. 


2. Review of Literature 

2.1. The Social Media 


The advent of the so-called social media over the Internet has impacted the way people live in the 
digital age. The social media has become a ubiquitous tool for people to meet, communicate, and 


Flickr, YouTube, and Instagram ( 

Adelson and Rose 

. 2004 

Butterfield and 

^akd. 2004: 

Chen et ah. 

2005; 

Page and Brin. 

1998: 

Svstrom and Kriegerl. 2 

Olol: z 

'uckerberg et al.l. 

20041') . Figure fTl shows 


one of the many visualizations of the inferred conceptual framework of the current social media as 
utilized by and in the point of view of an account owner (Hayes, 20081 ). 



Figure 1: A popular visualization of the conceptual framework of the Social Media bv lHavesI (j2008l i 
out of the many visualizations that exist. Here, the social media is visualized as a social 
web. 


The exponential proliferation of social media sites offering varied services and tools for meeting, 
communicating, and collaborating with other people has significantly transformed how the so- 
called connected people’s lives. Now, the social media sites are expected to function as real-time, 
on-site sources of trusted information, as well as a medium for reporting events as they happen. 
For example, in the recent natural disasters that hit the Philippines, particularly the 2013 Bohol 
Earthquake, the strongest tropical cyclone ever recorded that m ade landfall Ha i yan (Typhoon 
Yolanda), and the tropical cyclone Rammasun ( Typhoon Glenda) (jFischetti 120131 : iLaranol . l2014l : 
Mangosing . 2013 : Marquez . 20131 : Masters . 2013), people in the social web have extensively used 
the services either as sources of breaking news or as means for sharing information that they 
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have experienced first hand. The services can be used in real-time that users were given the 
capability by the social web to share their experiences a.s the event happe ns, often validated by 
different people who also happen to experience the same ( Nagar et ah . 2012l l. This seemingly real¬ 
time and interactive capabilities were not quite possible using the technologies used in broadcast 
communication, where information only flows from the broadcaster to the target audience. 

The trust that people give to the social media as a source of real-time and reliable information 
has already equaled, if not surpassed, the trust that they give to usual authoritative sources of 
real-time information like the television and the broadcast radio. In fact, adults in the U.S. have 
reported that the Interne t was their preferred sour ce for information and they ranked it as the most 
reliable source for news ( Zogbv Interactive . 20091 1. The public’s online activities increase during 
disasters because the peopl e have increa s ingly turned to the social media for the most up-to-date 
information. For example, iNagar et al.l ()2012l l have reported that the social media site Twitter 
was the most active site during the 2011 Japanese tsunami where users worldwide posted more 
than 5,500 tweets per second just after the disaster struck. Additionally, it was also observed that 
there is an increase in the prese nce of non-journalists who practice news reporting during natural 
calamities ( Caragea et ah . 2014l l. These are average citizens present in the location of the calamity 
and report happenings through the social media. Not only the social media is used for seeking and 
sharing information during disasters but, in addition, users expect emergency managers to respond 
to victims through real-time monitoring the social media posts. This fact is echoed by 75% of the 
1,078 respondents surveyed by the American Red Cross in 2010 who said that they “expected help 
to arrive within an hour if they posted a request on a social media site.” 


2.2. Microblogs 


One of the most important attributes of the social media that in recent years has increasingly been 
relied to by the people at large is its capability to allow users to post any information in real¬ 
time. Although the social media has services that allow sharing of events in visual forms through 
posting of amateur video clips (for example in YouTube), of photographs taken from hand-held 
cameras (e.g., in Instagram), or of computer documents (e.g., in Google-|-), “any” information also 
includes personal e xperiences of t he us ers that are often expressed as written texts. The social 
media site Twitter ( Dorsev et ah . 2006l l is one of the services in the Internet that allows posting 
of short texts that are often updated by users several times in a day. The posts, each containing 
up to 144 characters and aptly called a microblog or tweet, often reflect the uninhibited spurt-of- 
the-moment sentiment of a user to verbalize in written form a current personal experience. Aside 
from tweets, Twitter all ows the sharing of micromedia like photographs, video clips, and audio 
clips ( Sakaki et ah . 2nid l. 


Because of the uninhibited nature of the microblogs, other users who particularly have close ties to 
a user U often come back to the social media site several times in a day to check what U is doing 
and what U is thinking about now. Oftentimes, the user 7/ is a celebrity like the internationally 
renowed singer Katie Perry, US President Barack Obama, and the Roman Catholic’s Holy See Pope 
Francis, whose number of users who follow them are in the range of millions. In fact , as of early 
January this year, they respec tively have 49.OM , 40.7M, and 3.0M followers ( Pabico . 20141 1. but 
recent report ten months after (|van Zantenl . l2014l l shows that they already have 60.7M, 50.5M, and 
4.8M followers. For some personalities in the show, entertainment, and advertising businesses, the 
number of Twitter followers correlates to success in product endorsements, which necessitates the 
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daily frequent update of the microblogs to keep the followers, if not to maintain a positive growth 
in their number. 

Second only to personal sentiments, a large number of microblog updates results in numerous 
reports related to social events, such as parties, sports, and political campaigns. Some events 
include disasters such as storms, res, traffic jams, riots, enundations from heavy rainfall, and 
earthquakes. Actually, it has been documented by researchers that Twitter has been used for 
various real-time disaster response mobilization activities especially for users seeking help during 
a large-scale re emergency and live traffic updates. Adam Ostrow, the Editor in Chief of a social 
media news blog Mashable, opined the following about the interesting phenomenon of the real-time 
media: 


Earthquakes are one thing you can bet on being covered on Twitter rst, because, quite 
frankly, if the ground is shaking, youre going to tweet about it before it even registers 
with the uses and long before it gets reported by the media. 


This report has been the motivation behind several independent studies which used a new method 
called crowdsourcing to “sense” the environment not by any physico-chemical devices but through 
what is called social sensors. Crowdsourcing is the process of obtaining sensed data from the 
seemingly numerable independent and unrelated humans, while social sensors are humans who 
are currently experiencing a disaster. Examples of studies that reported s uccess in using crowd 


2011 


sourced social senso rs are identifying the rate of contagion in flu epidemics (lAramaki et ah, , 

Li and Cardie . 201,111 . pre dicting the epicenter o f earthquakes ( Sakaki et ah . 2010I L and nowcastiner 
the trajectories of storms (|Eraustino et al.l . l2012l L Nowcasting , a contraction of now and forecasting. 


is the forecasting of observable physical phen omenon, such as the weather an d the economy, within 
the next six hours with reasonable accuracy (jLamDos and Cristianinil . l2012l l . 


With the growing number of people residing near different disaster-prone areas, the need for 


a tailor-fit, updated , and fast warning and response system is increasing (jCaragea et ahl. l2014l : 


Mandel et ah . 20121 1 and Twitter is one of the tools that are being used for this purpose. Its real¬ 
time nature makes it a good outlet of messages for reporting and sharing disaster-re lated news, and 
even for asking help ( Caraeea et al. . 2014 : Li and Cardie . 2013 : Sakaki et ah . 2O10ll . Because of its 
wide-spread utility, not only news agencies use the Twitter social media to broadcast news, but 
also science researchers who an alyse the population sentim ent during disasters using the Twitter’s 
extensive data stream as input ( Pak and Paroubek . 2010l l. 


2.3. Sentiment Analysis 


Sentiment Analysis (SA), also called opinion mining, is the use of technologies developed in the 
Computer Science’s subfield of Natural Language Processing (NLP) to automatically analyze (usu¬ 
ally written) texts and identify the prevailing sentiments in these texts. Sentiments identified are 
attitudes, emotions, or opinions towards a certain observed phenomenon. Recently, SA has been 
utilized in a number of disciplines like business, finance, and pol itics. In finance, it is used to 


measure the investment mood of people (jSanchez-Rada et ahl . l2^14^ , while i n politics, it is used to 


deter mine the pulse of the voters and predict a possible winner of an election (jBravo-Marauez et al. 

201, il l. 


There are two methods commonly used to categorize sentiments for SA: machine learning, and 
lexicon-based. Machine learning (ML) is the use of a combination of various proper computational 
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intell i gence techniques t hat allow a computer to learn patterns from examples (|Kovalii and Provostl . 


1998 


Kruse et al. . 2ni.‘ll l. The computational patterns to learn in thi s aspect is to classify the con¬ 


textual polarity of texts into either positive or negative sentim ents (jShalunts et all l2014l l. Some¬ 


times, a neutral category is added (jKoppel and Schled. 120061). Examples of coin putational tech¬ 
niques tha t are commonly used for SA are Naive-Baves (jCamallo and Garcia . 2014), Support V ector 
Machines ( Mullen and Collier . 20041 1 . and Artificial Neural Networks ( Sharma and Dev . 2012l l. On 
the other hand, the lexicon-based method uses a set of words called sentiment dictio nary or lexicon , 


wher ein each word is associated to either positive, neutral, or negative sentiment (|Taboada et al 
201 il l. When a word w is encountered , the s entiment dictionary is used as a look-up table for 


assigning CSP to w ( Palanisamv et ah . 201.‘ll l. This makes the lexicon-based method easier to 


implement than the machine learning ones. It is not difficult to combine the two methods to come 
up with a better SA as evident in the work of Zhang et al. I dioiH i as both methods are not mutually 
exclusive. 

Formally, SA is a transformation T, such that T{w) —>■ s G {,0,-1-}, where T(w) is the transforma¬ 
tion of a word tc to a CSP s, which is either negative (), neutral (0), or positive (-I-). Note here 
that T is a transformation function that could either be generated by any ML heuristic, a look-up 
table based on a lexicon of words, or a combination of both. The transformation T is oftentimes 
the core function of an automated classifier C which accepts as inputs a sentence or phrase /, and 
outputs O as the sentiment of I. Converting from / to O includes the following steps: 

1. Partition I into parts of a sentence or a set of words W = {wi,W 2 , ■ ■ ■ ,Wn} using a NLP 
technique ( Groh and Hauffa . 2011 : Liu, 2nifllL 


2. Use T on each Wi G W, Vi = 1,2,..., n, to obtain a set of CSP S' = {si|i = 1, 2,... , n}; 

3. Provide a final CSP score over S using some function, the simplest of which is an averaging 

one: Smean = n~^ X Si] and 

4. Return Smean as O. 


2.4. The Twitter Tweet API and Its Utility 


The Twitter Tweet Application Programming Interface (TAPI) is a set of computer commands 
provided by the Twitter developers for exclusive use of programmers to allow the m to tap into 


the T witter data stream and gather tweets at a specific timeframe and geo-location (iDorsev et al. 
20061 1 . Nowadays, most tweets that are sent from mobile smart phones are already geo-tagged 


because of the device’s capability to receive geo-positional data from Global Positioning System 
(GPS) satellites. Once the streamed tweets have been collected by a computer program that uses 
the TAPI, they will become inputs I to an automated classifier C which will output the CSP O of 
the tweets. 

With the timestamp t, geolocation g = (x, y) where x and y are respectively longitude and latitude 
data, and CSP O, a tweet data can be formalized as a triple {t,g, O). A collection of these triples 
over an area within some time interval can be plotted in a time-evolving Geographic Information 
System (GIS) to visualize the dynamics of a population’s sentiment. This idea resulted in var- 


tvohoons. diseases, and other natural phenomena (Aramaki et ah. 

2011; 

Fraustino et ah. 

Lamoos and Cristianini. 

2 OI 2 I: 

Li and Cardie. 

2013; 

Mandel et ah. 12012: Sakaki et ah. 

201 c) 


20121 : 
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2.5. Real-time Tracking of Natural Disasters and Diseases 


A number of computer applications using the TAPI have been devised to track natural disast ers and 
diseases in near real-time. The earliest reported application is that of Sakaki et al. ( 2011)1 1 where 
the created computer system monitors the tweets of Japanese users and detects from the tweets 
mentions of experiences of earthquakes with high probability. The system notifies Twitter users in 
Japan much faster than the broadcast warnings of the Japan Meteorological Agency (JMA). The 
basic idea of the application is that each Twitter user acts as a “social sensor” which (or who) 
detects an event (i.e., an earthquake) and reports the experience with certain probability. Since a 
tweet includes the timestamp t and the geolocation data the computer system can then infer the 
trajectory of the shock propagation of an earthquake over an area and can send early warning to 
the potentially affected population. 

A similar system, where not only mentions of earthquake events were dete cted but also the C SP’s 
of online posts in both Tw itter and Faceb o ok, w as subsequently cr eated by Doan et al.l (2012) and 
then further improved by Vo and Collier ( 2013l l. In the work of Doan et al. (|2012l i. 1.5 million 
online posts were investigated from 9 March 2011 to 31 May 2011 to track the awareness and 
anxiety levels of the residents in the Tokyo Metropolitan District in relation to the 2011 Tohoku 
Earthquake an d the tsunami a. n d the nuclear emergencies that follow. An improved system was used 
in the work of Vo and Collier ( 20131 ') where they also detected other emotions (e.g., unconcerned, 
concerned, calm, unpleasant, sad, fear, and relief) in addition to the anxiety of the population 
affected by the Japan earthquakes of 11 March, 7 April, 4 April, and 10 July all in 2011. The 
studies found out that SA in online posts relating to a sequence of disasters (earthquake, tsunami, 
and nuclear emergency) is a good early warning system for the target population and a useful 
resource for tracking the dynamics of the population’s general sentiment. The studies emphasized 
the resiliency of the Japanese people in facing a series of disasters as the SA showed the anxiety 
and fear levels of the population quickly returning to normal within the day after the disasters. 

Starting from the reported system of Sakaki et al. ( 20101 '). several other researchers followed in 
tracking the sentiment of a population that are affected by natural disasters using a comb i natio n 


of the Twitter API and SA such as tho se of Hurricane I rene in August 2011 (jMandel et al.l . l2012l l 


of Hurricane Sandy in October 2012 (jCaragea et al.l . [201J), of th e flooding in th e Philippines 


caused by the Southwest monsoon (or Habaga t) in August 2012 ( 


flooding in Germany and Austria in June 2013 (jShalunts et al.l . l2014l b Not only natural disasters 


ee et ah . 2013h . and of the 


can be tracked but als o man-made crises su c h as the gaspipe explosion of September 2010 in 
San Bruno, California (jNagv and Stemberged . l2012l ). Similar systems were also created to track 
disease ep idemics such as the s urveillance of influenza-like illnesses in severa l regions of the United 
King dom ( Lamoos et ah . 2010l l. of the pandemic outbreak of HlNl in 2009 ( Chew and Evsenbach . 


King dom (|ljamDos et ai.l . l^dlU I. oi tne pandemic outbreak oi MllNl m zUUt) (It mew and Kvsenbacni . 
2 OI 0 I 1 . and of several cases of Dengue epidemics in Brazil ( Gomide et ah . 201 ih . These studies 


found that tweets about a certain event (disaster or epidemic) originate from different locations, 
and during the peak of the event, the CSP of the tweets are at its most negative and concentrated 
within the location where the event strike s. After the peak of t h e event, CSP’ s of tw eets connected 


to the disaster vary at different loc ations (ICaragea et al 


the tweet sentiments clutter, while IVo and Collieii (j2013l ') noticed that the CSP’s went back up to 


2014). Nagar et al. ( 2012ll observed that 


normal within 24 hours. Both studies agree, however, that the strength of CSP’s are mostly based 
on the tweets’ distance from the event. Most researchers opined that these CSP’s are helpful in 
building social awareness of the phenomenon. 
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2.6. Twitter and Taal Lake FKE 


Building from the various works mentioned above and the perceived significant utility of the Twitter 
micropost as “social sensors” for FKE in Taal Lake, a prelimina ry study was conducted to test 
whet her certain tweet features will co-occur with a reported fishkill ([Pabico and Magcale-Macandoej . 
2ni4l i. Words that are included in the COS, such as amoy asupre, berdeng tubig, maitim na tubig, 
and nahibay na isda were observed to have gained high frequency of occurrences in tweets whose 
geo-location is within the lOKm radius of the peak of Taal Volcano (14°N, 121°E). The frequency 
of occurrences increased above the normal within one week before two officially reported EKEs in 
February 2013 and in January 2014. These suggest that the Twitter users from within the Taal 
Lake area and whose livelihood depend on fish farming are talking (and probably reporting) to 
each other through Twitter about the observed symptoms of hsh kills. The more the fishfolks talk 
about it means that they have observed the symptoms much more frequently. One can already 
infer that an onset of an FKE is already happening. 


This current work is built on the earlier work of ([Pabico and Magcale-Macandod . l2014l ) and 
improves it with the computation of the CSP’s of tweets. The earlier work provided a basis for 
considering the frequency of words in the COS as an indicator for FKE. However, not all tweets 
in the area include words in the COS. Most of the tweets contain quantifiable sentiments that 
could provide with high certainty that as the polarity of the tweets become more negative, one 
can measure that the residents are becoming anxious as a result of their direct observation and 
perceived water quality of the lake surface. 


3. Methodology 


3.1. Programming a Scraper with TAPI 


A computer program termed here as pScraper was written to automatically scrape the Twitter 
social networking site for tweets originating from within the lOKm radius of the Taal Volcano crater 
(at geo-location 14°N, 121°E). Scraping is the process of extracting pertinent data from web pages 
obtained from crawling the Internet. Crawling a set Wn of n web pages Wn = {po,Pi, ■ ■ ■ ,Pn-i} 
means downloading the subset Wn-i = {pi,P2, ■ ■ ■ ,Pn-i} C Wn web pages given the initial web 
page pq. From the respective uniform resource locator (URL) links in hypertext markup language 
(HTML) anchor tags found in a web page pi, the next web page pj can be obtained and whose 
respective data can be scraped, 'ipi,Pj G Wn, i ^ j- 


Twitter provided TAPI to allow for the automatic scraping of Twitter pages ( Dorsev et ah . 20061 1 


The automatic scraping of tweets originating from within lOKm radius of 14°N, 121°E was per¬ 
formed from 03 February 2013 to 30 January 2014. The Twitter posts collected are from 27 January 
2013 to 30 January 2014 (369 days) where only two FKE were recorded, one that happened in 02 
Eebruary 2013 and another in 16 January 2014. 


TAPI allows for the collection of Twitter posts up to seven days earlier ( Dorsev et ah . 2006l l and 
so posts before the 03 February 2013 FKE were obtained during the start of the scraping process. 
The code for pScraper was written in Perl V5.10.1. 
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3.2. Collecting and Archiving Tweets 

Using pScraper, a total of a little over 3.4M different tweets were collected. These tweets were 
sent by 62,569 unique Twitter account users during the scraping period from 03 February 2013 
to 30 January 2014. A microcomputer platform with a 2-core 2GHz 686-based processor, 1GB 
shared primary memory, 80GB secondary memory, lOOmbps Ethernet-based network interface, and 
Version 6.4 of Scientific Linux Distribution of 64-bit Gnu/Linux Operating System with Gnome 
3.8.4 XWindows Manager was used to run pScraper. Since tweets can be sent by the account 
users anytime, pScraper was scheduled to automatically run by the standard Gnu/Linux scheduler 
Cronie VI.4.4 every 15min for a total of 35,424 runs (i.e., 369 days x 24 hrs/day x 4 runs/hr) 
during the scraping period. Because of the enormous amount of collected data, all tweets including 
their metadata such as the timestamp t, geolocation g, and Twitter account user U were saved in 
a relational database system (RDBMS) for a much more efficient data archiving and processing 
system. The RDBMS used to archive all 2.4M tweets including each tweet’s metadata is MySQL 
V5.1.69. 


3.3. Computing the Sentiment Polarity of Tweets 


A classifier (called here as pClas s) using a lexic o n-bas ed sentiment analyzer with the English 
lexicon coming from the works of iTaboada et al.l (|201lh was created. Eilipino words and their 
corresponding sentiments were added manually to the lexicon. Each of the tweets in the database 
was then fed as input to pClass while pClass’ subsequent output was then saved with the same 
record as the input into the RDBMS. Similar to pScraper, the code of pClass was also written in 
Perl V5.10.1. The program pClass was run every midnight during the scraping period for a total 
of 369 runs. Thus, just right after every midnight of any day di during the scraping period, the 
CSP of the collected new tweets during day dj_i were computed, and that the GSP of all newer 
tweets that were added during day di were computed several minutes at the start of day dj+i. 


3.4. Temporal Analysis of CSP 

The temporal variation of GSP was conducted using several time frames: hourly, daily, weekly, and 
monthly. Since the timestamp of each tweet with its GSP is saved in the RDBMS, the data for 
each of these time frames were extracted from the respective simple SQL queries to the RDBMS. 
The hourly analyses (hourly total and average) were conducted to see if there is diurnal variability 
in the sentiment and to capture time-of-day effects in the sentiment of the Twitter account owners. 
The daily, weekly, and monthly analyses (also total and average) were also conducted to see sea¬ 
sonal variabilities in the sentiment at various granularities (daily, weekly, and monthly granularity, 
respectively). The respective data were input to Scientific Linux’s standard data visualization pro¬ 
gram GnuPlot V4.2.6 that resulted to the respective temporal plots of various time frames. This 
paper, however, only presents the daily analysis due to space constraints. The respective analyses 
for the hourly, weekly, and monthly time frames will be presented in the archival publication version 
of this paper. 

The EKE data during the scraping period was superimposed to the various temporal plots to 
visually see if a pattern in GSP, for any time frame, will have an observable co-occurrence with the 
EKE. Only visual analysis was performed and was assumed to have sufficed because the number of 
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FKE record is very small, i.e., two occurrences: one in 02 February 2013 and another in 16 January 
2014. A more sophisticated analysis involving machine learning techniques for mining big data and 
conducting analytics will be used in the future once enough FKE data with the same time range 
coinciding with the tweet collection is available. 


4. Results and Discussion 

4.1. The Scraper 

To be able for the pScraper to collect data seamlessly from Twitter using TAPI, it calls Twurl via 
Scientific Linux’s system call. Twurl is a Python-based program that connects to Twitter through 
various Internet data communication protocols and uses TAPI to obtain tweet data. Twurl returns 
to pScraper a JSON-formatted response from Twitter which pScraper reads using Perl’s JSON 
library. The collected tweets are then archived by pScraper to the RDBMS via Perl’s DBI library. 
Figure [2] shows the conceptual relationship and data exchange among the computer programs 
pScraper, Perl, Twurl, Python, JSON, and TAPI. The figure represents two conceptual relation¬ 
ships (vertical and horizontal) between any two machines (actual or virtual). Vertical relationship 
means that the machine represented by the block at the bottom is running a machine that is directly 
above it, while horizontal relationship between any two machines means data exchange is possible 
between them. For example, the microcomputer system runs Scientific Linux, which in turn runs 
Cronie, which in turn runs Perl, which in turn runs pScraper, while pScraper may exchange data 
with RDBMS. 


RDBMS 


pScraper 


lO 


DBI 


JSON 

Perl 


Cronie 


Twurl 

TAPI 

Python 



Scientific Linux V6.4 


Microcomputer System 





Twitter 


# Operating System 
O Primary Programs 
@ Secondary Programs 
O Library 
O Internet Services 
O Written Code 


Figure 2: The conceptual relationship and data exchange among pScraper, Perl, Twurl, Python, 
JSON, and TAPI: (I) pScraper calls Twurl via a system call; (2) Twurl uses TAPI to 
connect to Twitter via various Internet communication protocols; (3) Twitter uses TAPI 
to return a JSON-formatted tweet data through various Internet protocols; (4) Twurl 
returns to pScraper the JSON-formatted tweet data; and (5) pScraper sends the tweet 
data to the RDBMS for archiving through DBI. 


With pScraper, the scraping process does not only result to gathering the raw data from the tweeter 
posts but also to identifying the relationships between any two pages pi and pj, Pi,Pj G Wn, Vi / j. 
The identification of relationships between any two pages though their respective HTML anchors 
allows for the inference of the topology of the network J\f(Wn,L) of web pages. Thus, from Wn, 
the set of links L = {{i, j)\pi,Pj G Wn} can be obtained. 
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4.2. The Collected Tweets 


Figure El shows the daily number of tweets collected by pScraper and archived to the RDBMS during 
the scraping period. A total of 3,423,413 tweets were gathered, of which 58% are non-sentiment 
tweets (i.e., pClass identified them as neutral tweets) and the remaining 42% are sentiment tweets. 
These tweets were sent by 62,569 distinct users. On the average, each user sent 55 tweets per day. 
The minimum number of daily tweets is 2,269 which was sent by 1,803 unique users on 16 August 
2013. The maximum number of tweets was sent on 18 June 2013 when a total of 15,995 tweets 
was reportedly sent by 2,804 distinct users. The average daily number of tweets is 9,278 (with 


02 Feb FKE 



jun 

Scraping Period (2013-2014) 


Figure 3: The daily total number of tweets collected by pScraper. 

During the collection of tweet data, it was assumed that one active account pertains to one indi¬ 
vidual only and that the sentiments of the tweets are a good proxy for the sentiment of the account 
users during the time of the posting of the tweet. It was also assumed that the number of pretenders 
and posers, although they exist, is insignificant to affect and skew the general polarity of tweets of 
the population. This assumption already suffices due to the absence of an official study conducted 
by researchers or simply by an statement from the SNS owners estimating the number of pretenders 
and posers. This absence is understood because the social networking site is highly dynamic in 
nature that estimating this number involves a lot of computational resources with little hnancial 
benefits to either the account users or SNS owners. Further, the process is already considered a 
gargantuan task if not an impossible one to perform. 


4.3. Computing the CSP of Tweets via pClass 

Figure 0] shows the conceptual relationships among the different computer programs and pClass 
in computing for the CSP of tweets. Here, pClass extracts the tweets from the RDBMS via Perl’s 
DBI library. Each tweet is then processed using NLP and each word is compared to the words from 
the English or Filipino lexicon. The lexicon is a simple look-up table which provides the possible 
sentiment of a given word. Once the CSP of the tweet has been estimated, pClass then updates the 
tweet in the RDBMS with its computed CSP following the simple averaging technique discussed in 
Section II-C. 
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Figure El shows the daily total number of non-sentiment (or neutral) and sentiment tweets. The 
average daily non-sentiment tweets is 5,288 (cr = 1, 238) while the average daily sentiment tweets is 
3,987 (cj = 934). On the average, the neutral tweets outnumber the sentiment tweets on a daily basis 
by about 1,300 except for some observed aggregate dates. In particular, there are three aggregate 
dates where the number of sentiment tweets is more than the number of neutral tweets: (1) About 
the start of February 2013; (2) About the second week of November 2013; and (3) Towards the 
third week of January 2014. Notice that the two FKE’s occur on 02 February 2013 and 16 January 
2014. A 2013 event that is memorable to the minds of most Filipinos is 07-08 November 2013 
when the world’s strongest tropical cyclone Haiyan/Yolanda hit the Philippines. This observed 
pattern provides evidence that the people in Taal area have expressed their sentiments more during 
FKE, where the event directly affected them. The extraordinary destruction brought about by the 
Supertyphoon Haiyan/Yolanda has also made the people express their sentiments even though they 
were not directly physically affected by it. 


RDBMS 


pC 

dass 

DBI 

Perl 


Cronie 


English 

Sentiment 

Lexicon 


Filipino 

Lexicon 


Scientific Linux V6.4 


Microcomputer System 


Eigure 4: The conceptual relationship and data exchange among pClass, RDBMS, Perl, Cronie, 
and the lexicons: (1) pClass extracts tweets from RDBMS via DBI; (2) pClass consults 
the lexicon via a system call; (3) pClass reads the sentiment from the lexicon; and 
(4) pClass updates the tweet record in the RDBMS with its estimated sentiment. 



Scraping Period (2013-2014) 


Eigure 5: The daily total number of non-sentiment and sentiment tweets as computed by pClass. 
Black bars are non-sentiment and red bars are sentiment tweets. 
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4.4. Temporal Analysis of Tweet Sentiments 

Figure El shows the daily total positive and negative polarity tweets from 27 January 2013 to 30 
January 2014. In general, the daily positive polarity tweets dominate the negative polarity ones 
except for two groups of aggregate dates: (1) from 29 January 2015 to 03 February 2013; and 
(2) from 22 to 28 January 2014. The 02 February 2013 FKE is within the first aggregate while 
the 16 January 2014 FKE is within the second. Equations [T] and [2] present the linear equations 
representing the total daily negative (Ineg) and positive (Fpos) polarity tweets as a function of the 
day of the year {X). Both equations have their respective slopes that are not significantly different 
This suggests that throughout the year, both values are fairly constant. 

Kneg = 0.16°"X+ 857.13,72 = 0.04 (1) 

Ypos = -0.26“"X +3151.34,72 = 0.03 (2) 


Scraping Period (2013-2014) 

Eigure 6: The daily total number of positive and negative CSP tweets. Orange bars are positive 
CSP tweets and blue bars are negative CSP tweets. 

Even though that linear trends in Equations [T] and [2] suggest a fairly constant total daily negative 
and positive tweets, the respective linear trends at six days before an FKE suggest a linearly 
increasing negative sentiment and a linearly decreasing positive sentiment trends for both groups 
of aggregate dates. Equations El and E] show the respective trends for the negative and positive 
sentiments of the 02 February 2013 FKE, while Equations El and El represent the respective trends 
of the negative and positive sentiments during the 16 January 2014 FKE. Without loss of generality, 
the independent variable X in Equations [T] and El was replaced with W in Equations El through El 
which now represent the reversed number of days before the observed FKE. 


from zero. 
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Tneg,02FEB2013 = 91 5.63”W + 366.13,72 = 99.50 (3) 

Tpos,02FEB20i3 = 579.00** W + 3809.00, 72 = 97.01 (4) 

Fneg,16JAN2014 = 485.49**1++ 2014.80,72 = 99.78 (5) 

Tpos,i6JAN20i4 = 119.60**1++ 1843.27,72 = 98.29 (6) 


Notice here that the trends of the negative sentiments for both FKE have positive slopes suggesting 
increasing number of negative sentiments up to the day of the EKE (Figure [7]). Concurrent to that. 
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the trends of the positive sentiments for both FKE have negative slopes that show that the number 
of positive sentiments decreased significantly up to the day of the FKE. 

There seems to be a pattern of resiliency exhibited in Eigure [7| by the trend of the number of 
negative and positive sentiments just merely two days after a FKE. Notice that just a day after 
the FKE, the number of negative sentiments decreased while the number of positive ones increased 
for both EKE. Two days after the EKE, the positive sentiments started to outnumber the negative 
ones. 

In FigureO other than the two aggregate dates that encompass the two FKE, there is an observable 
spike in the number of negative tweets the occur from 10 to 14 November 2013. Although this spike 
did not outnumber the number of positive tweets within the same period, its occurrence is of interest 
because this pattern was observed to happen two days after Supertyphoon Haiyan/Yolanda struck 
the central Philippines on 07-08 November 2013. Even the typhoon did not directly physically 
affect the residents of Taal, they seemed to be affected emotionally by the evidence of outpouring 
of the sentiment. The number of positive sentiments dominating the number of negative sentiments 
within these dates may show that the people in Taal were sending positive sentiments. A possible 
(intuitive) explanation to this observation is that the people in Taal are hopeful that those affected 
by the supertyphoon will “rise up from the aftermath.” The semantics of the tweets coming out of 
the textual analysis that support this intuition will be presented in another forum. 
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Eigure 7: The respective linear trends of the negative and positive sentiment tweets around (a) the 
02 February FKE, and (b) the 16 January 2014 FKE. 


5. Summary and Conclusion 

This paper presented a system for sensing in real-time the sentiment of a population experiencing 
the onset of a FKE in Taal Lake. The FKE is seen as a rare event owing to the frequency of the 
occurrence of the event vis-a-vis the frequency of its non-occurrence. Developing a model that will 
provide high-fidelity prediction on when and where in the vast expanse of the lake will FKE occur 
using the standard regression and time series techniques have proven to be difficult because of the 
unbalanced nature of the dataset made complicated by the presence of missing data. Modeling 
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techniques that rely on machine learning and predictive analytics, although successful in providing 
an acceptable accuracy rate, result in models that require an expensive network of sensors that 
must be installed not only on the surface of the vast expanse of the lake but also several meters 
deep along the lake’s depths. Investing in such large network of sensors was foreseen to be costlier 
than the savings that may be obtained from avoiding the FKE. With the pervasive nature of ICT 
reaching even the most remote corner and used by the least expected people, it was found out from 
previous studies that CSP of Twitter posts can be used to track in real-time natural disasters and 
social phenomena. The system discussed in this work, with the aid of TAPI, accesses the texts on 
dated and geolocated posts on the social networking site Twitter and the corresponding CSPs of 
the texts are computed. Based on collected data, the significant increase in negative CSP co-occur 
to two FKE that separately struck Taal Lake on 02 February 2013 and on 23 January 2014. This 
co-occurrence of these seemingly unrelated events may give proof that CSP from Twitter may be 
used to augment the predictive models developed that require expensive sensing infrastructure. 
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